We are a team of athletes and sports lovers, so we were interested in using data to research a problem in the world of professional sports. We decided to examine NFL injuries since particular injuries are common in football at the professional level, and their incidence and severity play a large role in determining the players’ comeback to play and a team’s success. Injuries don’t occur at random; we’d like to deduce some patterns underlying injury incidence to better understand the risks involved for players and teams.
In particular, we would like to examine which positions are most at risk for injuries, which injuries are most common, which teams have the most injuries (and if these teams are consistently the same each year), and whether injury incidence has evolved over time (and in particular, if concussion rates have decreased following the introduction of new helment technology in 2017).
We were initially interested in examining 7 questions, stated below:
We ultimately decided to only consider questions 1-4, for a few reasons.
For question 5, we ran into a time constraint: because we built our own data for this project by web scraping pro-football-reference.com instead of using a pre-built dataset, finding and building the additional data that would be needed to answer our question about weather conditions would have cost time that was better spent answering questions 1-4 to the best of our abilities.
We decided against studying questions 6 because as we began to work more closely with our dataset, we realized that the structure of our injury data source did not lend itself to answering our question about injury duration. It was not possible to reliably discern the length of time each athlete was injured due to one particular injury because, for example, it was common for athletes to have multiple injuries at once. Not to mention: we realized there was no way to know the duration of injuries that were incurred at the end of the season and which healed in the off-season.
As for question 7, we realized that this question could be subsumed by other questions under consideration. For example, in our analysis of question 1 we ran a logistic regression predicting binary injury occurrence using the covariates under study in question 7.
1) Data source Our data source was pro-football-reference.com. This website publishes a report of injuries for each NFL team, with a column for every game played by the team and a row for each player who had been injured during the season.
2) Web scraping method
A webscraping algorithm was created which iterated across each year available on the website (2009-2021) and across each NFL team. The webscraping script read all information available from the table and saved it in an R object.
3) Data cleaning
a) Data shaping
In order to conduct analyses on player injuries, injury tables were transformed into a table with one row per injured player per year. To do this, we first selected only the regular season games from each table to give equivalent estimates for teams who made it to the playoffs. We then counted the number of games a player was listed as injured (the sum of non-blank entries in the table) and the number of games a player was listed as not playing. In order to describe injuries, we concatenated all unique injury descriptions for a given player into one string.
Additional player demographic data were provided by gridironai.com. Because this dataset had one row for every player for every week, duplicates were removed so that there was only one row per player per season.
b) Injury classification
Once the data was scraped, there were issues with the free text in the injury column. In order to work with this data set to answer some of our primary questions, we needed to clean the data in such a way that the injuries could be easily modeled and analyzed. With so many injuries initially reported, as seen in the exploratory analysis below, we grouped the injuries into 8 main categories based on a part of the body. These included: Head, Shoulder, Upper Torso, Lower Torso, Arm, Hand, Leg and Foot. This way, the distribution of injuries was much easier to evaluate and draw conclusions from all while still keeping the injury data significant to each player. Once this decision was made, we went about the cleanup of the free text by first removing any special characters separating injuries from each other (mostly “") and replacing them with a space. Many injuries were read like :”knee arm concussion head" as one large string. To deal with this issue we created a code and function that allowed each specific injury to be re-coded in its appropriate category. For example the above string would be recoded as “leg arm head head”. Therefore we classified this player as having 1 leg injury, 1 arm injury and 2 head injuries. Once the injuries were re-coded, we used the mutate() function and created count columns for each of 8 body part injuries summing up how many of those injuries each player had. Once this was complete, we used the group_by() function and the summarise() function to obtain the the total counts of each body part injury. This way, we were able to move forward with answering some of our primary questions with a data set that was usable.
We conducted exploratory data analysis on each of our 4 questions of interest separately. Below we will describe initial analysis and results for each question.
To answer this question, we used the web scraped injury and player demographic data. We merged these datasets and created factors where appropriate. We then created a binary injury variable that had a value of “1” for injured players and a “0” for non-injured players. The summary statistics below were computed for each position over all the years:
| position_id | mean | sd | Q1 | median | Q3 |
|---|---|---|---|---|---|
| DEF | 511.8 | 130.1 | 413 | 603 | 617 |
| K | 10.38 | 5.009 | 10 | 11 | 12 |
| OL | 192.6 | 40.31 | 180 | 199 | 210 |
| P | 6.769 | 3.7 | 4 | 8 | 10 |
| QB | 32.69 | 12.01 | 24 | 36 | 41 |
| RB | 95.69 | 24.68 | 81 | 107 | 115 |
| TE | 66.85 | 15.79 | 62 | 69 | 77 |
| WR | 130.3 | 25.03 | 116 | 131 | 149 |
We could interpret the mean value for defensive players (DEF) by stating that about 512 DEF players are injured per season. Although informative, this approach does not provide insight into the number of injuries that a defensive player could expect to incur in a given season. To obtain this information, we created a new variable that counts the total number of injuries for each player. We then grouped by position and year to find the total number of injuries for each position. This value was then divided by the number of players and then the number of seasons to produce the average injuries per player per year, broken down by position. The only complication here is that while our dataset contains all NFL players who were injured, we do not have a complete count of all players who were not injured, meaning that we do not have an accurate count of players who had zero injuries. To account for this, we will instead find the average number of injuries per player per season among players who are injured, or the expected number of injuries for a player conditioned on them being injured at least once. These results as well as other summary statistics are presented below.
| position_id | avg_total_injuries | SD | Q1 | Median | Q3 | Min | Max |
|---|---|---|---|---|---|---|---|
| DEF | 1.643 | 1.035 | 1 | 1 | 2 | 1 | 16 |
| K | 1.17 | 0.3774 | 1 | 1 | 1 | 1 | 2 |
| OL | 1.565 | 0.9145 | 1 | 1 | 2 | 1 | 9 |
| P | 1.205 | 0.4589 | 1 | 1 | 1 | 1 | 3 |
| QB | 1.689 | 1.405 | 1 | 1 | 2 | 1 | 21 |
| RB | 1.711 | 1.107 | 1 | 1 | 2 | 1 | 10 |
| TE | 1.623 | 1.031 | 1 | 1 | 2 | 1 | 13 |
| WR | 1.697 | 1.042 | 1 | 1 | 2 | 1 | 9 |
Now, we can state the expected number of injuries for a DEF player is about 2 per season, or that a quarterback (QB) can also expect to get injured about 2 times per season.
In addition to total injury counts over the total 2009-2021 period, we are also concerned with the evolution of these counts over time. We chart this below.
We chose to use a log scale on the y-axis because this makes it easier to see the lines towards the bottom of the graph, which all overalap without the log scale. When viewing the plot, it is important to note that some positions have more players on the field than others. For example, there are eleven defensive players on at a given time but only one quarterback. This suggests that scaling should be performed to account for the different group sizes. Scaling by the number of players in each position would essentially be the average number of injuries per player per position. As in the table above, we can only calculate the expected number of injuries among players who were injured at least once. The plot of these averages over time is presented below:
Note that none of these values are less than 1, because all values are conditioned on players being injured at least once.
Another important part of our analysis was checking for missing data. The table below shows that we had a small number of missing measurements for player height/weight/bmi. These values were removed prior to the analyses.
| injury | position_id | height_inches | weight_pounds | game_starter | age |
|---|---|---|---|---|---|
| 0 | 0 | 13 | 11 | 0 | 0 |
| bmi | year |
|---|---|
| 13 | 0 |
After completing our initial EDA, we turned our attention to the classification task of predicting whether a player will be injured in a given season given their position, as well as other player information (e.g., height, weight, age, team). And, where possible, we sought to interpret odds ratios explaining the relationship between player position and injury risk. The three main classification algorithms we explored were logistic regression, k-Nearest Neighbors (kNN), and Random Forest. These methods were selected due to the categorical nature of our outcome (injured or not), and because of the varying degrees of flexibility afforded by these approaches. We first proceed with exploratory analysis to assess the suitability of logistic regression for this task.
From here, we opted to perform forward selection as a more systematic way of finding good features for the model. The process of forward selection begins by regressing the outcome variable on just the intercept, and then subsequently adds covariates to the model based on which obtains the lowest AIC value. The final results of this process appear below:
##
## Call:
## glm(formula = injury ~ 1, family = binomial(), data = players)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.20 -1.20 1.16 1.16 1.16
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.0423 0.0123 3.45 0.00055 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 36933 on 26649 degrees of freedom
## Residual deviance: 36933 on 26649 degrees of freedom
## AIC: 36935
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = injury ~ game_starter + year + position_id + age +
## weight_pounds + I(age^2), family = binomial(), data = players)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.016 -1.074 0.613 1.056 2.243
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.46e+02 7.27e+00 33.86 < 2e-16 ***
## game_starter 1.03e+00 2.77e-02 37.36 < 2e-16 ***
## year -1.23e-01 3.59e-03 -34.33 < 2e-16 ***
## position_idDEF 5.21e-02 5.54e-02 0.94 0.3471
## position_idK -1.01e+00 1.18e-01 -8.54 < 2e-16 ***
## position_idOL -1.48e-01 6.21e-02 -2.39 0.0170 *
## position_idP -1.33e+00 1.33e-01 -10.00 < 2e-16 ***
## position_idQB -7.61e-01 8.67e-02 -8.77 < 2e-16 ***
## position_idRB 1.90e-01 6.94e-02 2.75 0.0060 **
## position_idWR 1.30e-01 6.74e-02 1.94 0.0529 .
## age 1.45e-01 4.66e-02 3.10 0.0019 **
## weight_pounds -2.22e-03 3.84e-04 -5.79 7.2e-09 ***
## I(age^2) -1.92e-03 8.33e-04 -2.30 0.0212 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 36933 on 26649 degrees of freedom
## Residual deviance: 33561 on 26637 degrees of freedom
## AIC: 33587
##
## Number of Fisher Scoring iterations: 4
We see that Game starter (yes/no), year, position, age, age_squared, and weight (lbs) were the most useful predictors. Age squared was added as a potential predictor because of the possibility that age and injury status may have a quadratic relationship, where the risk of injury increases as players get older, but then decreases after a certain age since fewer people play football beyond 40. After completing the initial variable screening, we began fitting the machine learning models. The data was divided for the training and test sets, with the former receiving 70% of the data and the latter receiving the remaining 30%. The Confusion Matrix, Accuracy, Sensitivity, Specificifify of the models are presented below. In addition, the k parameter for kNN was found by two-fold cross-validation. We chose k = 21 as this is the point where the accuracy began to level off.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2347 1325
## 1 1566 2757
##
## Accuracy : 0.638
## 95% CI : (0.628, 0.649)
## No Information Rate : 0.511
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.276
##
## Mcnemar's Test P-Value : 8.06e-06
##
## Sensitivity : 0.675
## Specificity : 0.600
## Pos Pred Value : 0.638
## Neg Pred Value : 0.639
## Prevalence : 0.511
## Detection Rate : 0.345
## Detection Prevalence : 0.541
## Balanced Accuracy : 0.638
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2110 1555
## 1 1803 2527
##
## Accuracy : 0.58
## 95% CI : (0.569, 0.591)
## No Information Rate : 0.511
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.158
##
## Mcnemar's Test P-Value : 2.02e-05
##
## Sensitivity : 0.619
## Specificity : 0.539
## Pos Pred Value : 0.584
## Neg Pred Value : 0.576
## Prevalence : 0.511
## Detection Rate : 0.316
## Detection Prevalence : 0.542
## Balanced Accuracy : 0.579
##
## 'Positive' Class : 1
##
##
## Call:
## randomForest(formula = injury ~ position_id + age + I(age^2) + game_starter + weight_pounds + year, data = train_set)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 36.2%
## Confusion matrix:
## 0 1 class.error
## 0 5824 3306 0.362
## 1 3441 6084 0.361
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2496 1483
## 1 1417 2599
##
## Accuracy : 0.637
## 95% CI : (0.627, 0.648)
## No Information Rate : 0.511
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.274
##
## Mcnemar's Test P-Value : 0.227
##
## Sensitivity : 0.637
## Specificity : 0.638
## Pos Pred Value : 0.647
## Neg Pred Value : 0.627
## Prevalence : 0.511
## Detection Rate : 0.325
## Detection Prevalence : 0.502
## Balanced Accuracy : 0.637
##
## 'Positive' Class : 1
##
From the results, we see that the Random Forest obtained the highest accuracy of 0.639, with the logistic regression and kNN receiving accuracies of 0.638 and 0.613, respectively. The ROC curves and corresponding AUC values are presented below.
auc(roc_logi) # calculate AUC for each ROC curve
## Area under the curve: 0.701
auc(roc_knn)
## Area under the curve: 0.608
auc(roc_rf)
## Area under the curve: 0.693
We now will compare the three models based on Accuracy, AUC, and Sensitivity/Specificity:
Accuracy
We see that the logistic regression model has the highest accuracy.
AUC
We also see that the logistic regression model has the highest AUC. Now, since it is clear that kNN is the poorest performing model, we will only consider the logistic regression and random forest going forward.
Logistic Regression:
Random Forest:
The random forest is slightly more balanced in terms of sensitivity/specificity. However, the logistic regression attains a higher sensitivity.
All things considered, the logistic regression has the highest accuracy and the highest AUC of all 3 models. It also provides us with clearly interpretable odds ratios for the risk of injury. For these reasons, we will choose logistic regression as the best model for our situation.
We are also interested in understanding which injuries are most common in the NFL; we begin by simply charting the distribution of injuries, using the very granular injury type classifications available from our data source.
We see here that looking at each specific injury does give us information, but after about the first half of the injury list, the graph is not very impactful. We do conclude however from this graph that knee, ankle and hamstring injuries were the most common injury in NFL football players. After this initial EDA, injuries were group into more general body part injuries as head, shoulder, upper torso, lower torso, arm, hand, leg and foot. This way we were able to analyze the distribution of general injuries more clearly and come to a meaningful conclusion. This can be seen in the RShiny app provided in the EDA files and on the project website.
We first consider the total injuries per year over the entire 2009-2021 period, as well as the max injuries that a team had in a single season over this period.
It is clear that the Houston Texans have the most total injuries over this period, as well as the maximum injuries. Most teams had between 600-700 injuries over this period, and the Texans’ 170 injuries in a single season (2011) is a full 20% higher than the second most injuries per season during this time frame (142 for the Cleveland Browns in 2012).
Let’s take a look at the distribution of injury counts over each year in the 2009-2021 period, as well as the distribution of injury counts as shown in a boxplot.
From the final chart it appears that the Texans’ large number of injuries are concentrated in the 2011-2013 range. There is a sizable spread in the median injuries per season by team over the 2009-2020 seasons; the KC Chiefs had 35 injuries per year on median compared to whopping 85 for the Texans. So while it is the case that the Texan’s very high number of total injuries over this period is due to a few large outliers in 2011-2013, their high median injuries suggests that even in a typical season they have more injuries than most other teams.
Here, we will explore trends over time in NFL injuries.
First, we will create simple plot with the number of injures recorded each season.
It looks like there have been fewer injuries in recent years, but we can explore this data much further.
We can also look at the typical number of injuries players get in a season:
It is important to note that this data only includes players who were injured at least once, but it is helpful in understanding the typical distribution of the number of injuries per person. Among injured players, the median number of injuries per players was just one injury each season, but a few unlucky players experienced more than 10 injuries in a single season.
We can also break down these plots by teams, similarly to how we explored this in question 3:
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
The decreasing trend in injury counts appears to hold true for all teams.
Concussions in the NFL have received significant attention in the last several years due to concerns regarding CTE and permanent brain damage. The NFL has worked to reduce the frequency of concussions among its players. Has there been a decline in the number of concussions each season since 2009?
While there is significant variation over time in this plot, it appears that there is a downward trend over time in the number of concussions each season.
Does this trend hold true across all teams?
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
We see huge variation in this plot, both between teams and from year to year, with somewhere between 0 and 20 players experiencing concussions per team per season. However, we still see what appears to be a general decline in concussions over time.
Now that we have explored concussions more closely, let’s look at trends for all injury types.
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
Interestingly, we see similar trends across all eight injury types. Note that the y-axis is displayed on the log scale to improve readability by spreading out the lower lines on this plot.
We can visualize this information in a barplot as well, which is even easier to interpret:
The stacked barplot allows us to see the cumulative trends in injuries over time, as well as the breakdown by injury type.
While all of the previous plots show the same trend in injury counts over time, none of them provide any insight into the severity of each injury. Lastly, we want to explore the average number of games missed due to injury per player per season, as well as the average number of games injured per player per season, which includes players who play through injuries. These plots are shown below:
The plots actually show entirely different trends, highlighting the importance of exploring any data set extensively before drawing conclusongs about the data. In the first plot, we see the there a spike in the number of games missed due to injury from 2016-2018, which corresponds with a sharp drop in raw injury counts in the other plots. It is possible that in these years players had fewer injuries than in prior years, but these injuries were more severe, leading to more games missed per injury. It is also possible that new rules force injured players to sit out of games even when they want to play through their injuries. The second plot depicts a more consistent trend in games played while injured with a steep drop-off in 2019. It is interesting to note that the trends in these two plots match reasonably well from 2016 to 2020, with the number of games injured only slightly higher than the number of games missed due to injury. Before 2016, it appears that there were many more games played while injured, because the number of games injured is much higher than the number of games missed. This may be due to changes in NFL rules that prevent injured players from injuring themselves further by continuing to play.
mean(injuries$total_injuries)
## [1] 1.59
var(injuries$total_injuries)
## [1] 1.02
# we actually have under-dispersion
poisson_fit <- glm(total_injuries ~ year, family = poisson, data = injuries)
summary(poisson_fit)
##
## Call:
## glm(formula = total_injuries ~ year, family = poisson, data = injuries)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.702 -0.540 -0.349 0.314 8.031
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 67.26351 4.04241 16.6 <2e-16 ***
## year -0.03317 0.00201 -16.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7024.5 on 14589 degrees of freedom
## Residual deviance: 6749.2 on 14588 degrees of freedom
## AIC: 40298
##
## Number of Fisher Scoring iterations: 4
tried to explore this, but it won’t work because we don’t have a true count of people who had zero injuries
I’m thinking everyone can put down what they discovered for their analysis here